Sentence BERT

Introduction

While BERT is powerful for many NLP tasks, like guessing the masked word in a sentence we sometimes are interested in comparing the similarity of two sentences (or text chunks). For this purpose, we need to create a fixed-size sentence embedding for each sentence (or text chunk). Sentence-BERT (S-BERT) solves this problem by creating fixed-size sentence embeddings that can be compared efficiently using cosine similarity.

Simple Approach

Let’s assume we only have BERT model. The easiest way to create a sentence embedding is to average all the last hidden state embeddings of the tokens in the sentence. While this works to some extent, it doesn’t capture the semantic meaning as effectively.

S-BERT Approach

Sentence-BERT (S-BERT) fine-tunes BERT on sentence pair datasets using contrastive learning to create sentence embeddings that can be compared efficiently using cosine similarity.

Dataset

The most common dataset used is SNLI (Stanford Natural Language Inference) with 550,000 premise-hypothesis pairs.

SNLI Example:

Premise: “A person on a horse jumps over a log”
Entailing: “A person is outdoors on a horse” (logically follows from premise)
Contradicting: “A person is at a diner ordering an omelet” (contradicts the premise)
Neutral: “A person is training his horse for a competition” (could be true or false)

Training Objective

During fine-tuning, S-BERT learns to create sentence embeddings such that:

Similar sentences (entailment) are close together in the embedding space (cosine-sim = 1.0)
Dissimilar sentences (contradiction) are far apart (cosine-sim = -1.0)
Neutral sentences are at medium distance (cosine-sim = 0.0)

As seen in the image at top of the page, we pool the output of the BERT model to get the sentence embedding for each sentence. Then, we calculate the cosine similarity between the sentence embeddings to get the similarity score between the two sentences.